From Text to Embeddings
Understanding NLP Representations
The Core Question
"How do machines understand human language?"
The Journey We'll Take
'I love ML'"] --> B["π€ Tokens
['I', 'love', 'ML']"] B --> C["π’ Numbers
[245, 1089, 3421]"] C --> D["π Embeddings
[0.2, -0.5, 0.8, ...]"] D --> E["π€ Model
Understanding"] E --> F["β¨ Output
Predictions"] style A fill:#F3F2F1,stroke:#0078D4,stroke-width:2px style B fill:#FFF9C4,stroke:#FBC02D,stroke-width:2px style C fill:#FFE0B2,stroke:#F7630C,stroke-width:2px style D fill:#E1BEE7,stroke:#9C27B0,stroke-width:2px style E fill:#0078D4,stroke:#0078D4,stroke-width:2px,color:#fff style F fill:#107C10,stroke:#107C10,stroke-width:2px,color:#fff
This module demystifies each step of this transformation
What You'll Master
The Why
Why models need numbers
The Evolution
From BoW to Transformers
The Mechanics
How tokenizers work
The Choices
When to use what
The Practice
Hands-on workflows
The Integration
Connecting to LLMs
Your Learning Path (7 Parts)
Foundation
Why do models need numbers? What makes text unique?
Evolution
BoW β TF-IDF β Word2Vec β Contextual Embeddings
Mechanics
How embeddings emerge, tokenization deep dive
Decisions
When to use embeddings? Build vs pretrained?
Guidelines
What to watch out for, recommended workflows
Integration
Complete pipeline, connecting to modern LLMs
Practice
Hands-on Jupyter notebook with real data
Follow the path to master text representations
Part 1: The Foundation
The 3 Questions We'll Answer
Why Numbers?
Why do models need numeric inputs?
Why Is Text Different?
What makes NLP uniquely challenging?
The Core Problem?
What challenge must we solve?
Your Path Through Part 1
π― What Is a Model?
Models need numbers"] --> S2["Section 2
π Features vs Representations
Text is unique"] S2 --> S3["Section 3
β The Text Problem
The challenge defined"] style S1 fill:#E3F2FD,stroke:#0078D4,stroke-width:3px style S2 fill:#FFF9C4,stroke:#FBC02D,stroke-width:2px style S3 fill:#F3E5F5,stroke:#9C27B0,stroke-width:2px
Real Examples
Weather, Spam, Recommendations
Visual Comparisons
Structured vs Images vs Text
Key Insights
The double learning problem
Section 1: What Is a Model?
Before we dive into text, let's understand what models actually do.
π The Critical Insight
Machine learning models must work with numbers because they use mathematical operations (multiplication, addition, derivatives) to learn patterns. Text like "hot" or "sunny" cannot be directly multiplied or differentiated.
Left: With numbers, models can perform math operations and learn. Right: Text cannot be used in mathematical operations directly.
Let's see this principle in action with three everyday examples:
π₯ Inputs (All Numbers)
- π‘οΈ Temperature: 23Β°C
- π§ Humidity: 65%
- ποΈ Pressure: 1013 hPa
Model processes these numbers
π€ Output (A Number)
Rain Probability: 0.7 (70% chance)
π₯ Inputs (All Numbers)
- π§ Word count: 45
- πΈ Has "urgent": 1
- π Link count: 3
Model processes these numbers
π€ Output (A Number)
Spam Score: 0.9 (90% spam)
π₯ Inputs (All Numbers)
- π€ User age: 28
- π Past purchases: 15
- β±οΈ Time on page: 120s
Model processes these numbers
π€ Output (A Number)
Interest Score: 0.85 (85% likely)
What Do All These Have in Common?
(temperatures, counts, ages)"] --> B["π’ Mathematical Operations
(multiply, add, gradients)"] B --> C["π Numeric Outputs
(probabilities, scores)"] style A fill:#E3F2FD,stroke:#0078D4,stroke-width:2px style B fill:#0078D4,stroke:#0078D4,stroke-width:2px,color:#fff style C fill:#107C10,stroke:#107C10,stroke-width:2px,color:#fff
The Pattern: All models follow the same principle:
Numbers In β Math Processing β Numbers Out
π§ Deep Dive: How Models Learn From Numbers
Models don't just process numbersβthey learn patterns by adjusting parameters through gradient descent.
Left: A model finding patterns in numeric data (temperature β rain probability). Right: Gradient descent optimizing model parametersβrequires numeric derivatives!
Key Point: Learning requires computing gradients (derivatives), which only work with numbers. This is why "converting text to numbers" isn't just preprocessingβit's the fundamental bridge that enables learning.
π€ Check Your Understanding
Q1: Why do machine learning models require numeric inputs?
Q2: What makes NLP uniquely challenging compared to structured data?
β
Great! You now understand that models need numbers.
But here's the critical question: Are all data types equally easy to convert to numbers?
Section 2: The Unique Challenge of Text - Features vs Representations
Text is fundamentally different from other data types. Let's see why.
Comparing Three Data Types
Understanding how different data types are processed helps us see why NLP is unique
Top (Blue): Structured data features are given. Middle (Orange): Image features are learned implicitly by CNNs. Bottom (Purple): Text representations MUST be explicitly learnedβthis is the critical difference!
π₯ The Critical Difference: Double Learning Problem
Features given
β Learn relationships
Pixels given
β CNN learns features
β Learn relationships
Symbols given
β MUST learn representations
β Learn relationships
Left: Structured data has ONE learning problem (relationships). Right: Text has TWO learning problems (representations + relationships)βthis is unique to NLP!
π‘ Why This Matters: This double learning problem is why preprocessing and representation choices are so critical in NLP. Choose the wrong representation β the model can't learn the task effectively, no matter how sophisticated it is!
Two Approaches: Who Decides the Numbers?
Left (Orange): Feature EngineeringβYOU manually design what numbers to extract. Right (Purple): Representation Learningβthe MODEL automatically learns the best numbers.
π§ Feature Engineering
Who decides: Human expert
Output: Sparse vectors (mostly zeros)
Example: [5, 1, -1, 0]
You manually count words, negations, etc.
π€ Representation Learning
Who decides: Learning algorithm
Output: Dense vectors (all values meaningful)
Example: [-0.23, 0.45, ..., 0.92] (768D)
Model learns patterns from data automatically
Why This Matters for NLP
1οΈβ£ Double Learning Challenge
We're learning two things simultaneously: What are good representations? (embedding layer) + How to use them? (task layers)
2οΈβ£ Quality is Critical
Bad representations β model can't learn well
Good representations β model learns easily
3οΈβ£ YOU Must Decide
Unlike structured data (features given) or images (CNN handles it), in NLP YOU must choose how to convert text to numbers
Quick Comparison Across Domains
Structured Data
Input: Numbers
Features: Given by data
Learns: Relationships only
Challenge: Algorithm choice
Images
Input: Pixels (numbers)
Features: CNN extracts automatically
Learns: Features + Relationships
Challenge: CNN architecture
Text/NLP
Input: Symbols (text)
Features: MUST learn explicitly
Learns: Reps + Features + Relations
Challenge: Learning representations
Why we spend so much time on text-to-numbers:
Unlike structured data (features given) or images (conv handles it), in NLP YOU must decide how to convert text to numbers.
- Choose wrong representation β model fails (can't learn)
- Choose right representation β model succeeds (learns patterns)
This module teaches you to make that choice wisely.
The rest of the module answers: "What are good representations and how do we create them?"
Section 3: The Text Problem
Now we know models need numbers and text is uniquely challenging. So what's the actual problem?
The Challenge
Consider this movie review:
"This movie was absolutely fantastic! The acting was superb and the plot kept me engaged throughout."
How do we give this to a model?
'fantastic movie...'"] --> B["βββ
Convert to numbers?"] B --> C["π€ Model
(needs numbers)"] C --> D["β Positive/
β Negative"] style A fill:#FFF9C4,stroke:#F7630C,stroke-width:2px style B fill:#FFF100,stroke:#D13438,stroke-width:3px style C fill:#0078D4,stroke:#0078D4,stroke-width:2px,color:#fff style D fill:#107C10,stroke:#107C10,stroke-width:2px,color:#fff
The ??? represents the critical challenge we need to solve!
Models Need Numbers
For gradient descent and learning
Text Is Symbolic
Words, letters, punctuationβnot numbers
Need a Bridge
Text β Numbers
β οΈ The Critical Constraint
The numbers we create must preserve meaning. If we lose meaning in the conversion, the model can't learn useful patternsβno matter how sophisticated it is!
Left (Bad): ASCII codes lose all semantic meaningβmodel can't learn. Right (Good): Semantic embeddings preserve meaningβmodel can learn patterns.
π‘ Key Insight: Not all numbers are equal! The quality of your text-to-number conversion determines everything downstream.
π― The Question We Must Answer
"How do we convert language into numbers?
And which numbers?
There are infinite ways to do thisβwhich approach makes sense?"
The Text-to-Numbers Problem: Possible Approaches
Text
to
Numbers?
Bag of Words
(Count)
= Simple
= No order!
TF-IDF
(Weighted)
+ Informative
= Still sparse
Word2Vec
(Dense)
+ Semantic
= Static
Contextual
(BERT)
+ Context-aware
= Expensive
Character
(Char-level)
= Complex
= too granular
Subword
(BPE)
+ Balanced
= Complex
N-grams
(Sequences)
+ Expressive
+/- Info
Hash
(Encoding)
+ Fast
= Collisions
+ Fine-grained!
Each approach has trade-offsβthere is no single 'best' solution!
Multiple approaches exist: Bag of Words, TF-IDF, Word2Vec, BERT, Character-level, Subword, N-grams, Hashingβeach with trade-offs!
π₯ Why This Matters
β Bad Representation
Model can't learn patterns, no matter how good the architecture
β Good Representation
Model learns easily, even with simple architecture
This isn't just preprocessingβit's the foundation of everything in NLP.
The rest of this module teaches you which representations to choose and why.
Part 1 Complete!
You now understand: WHY models need numbers, WHY text is uniquely challenging, and WHAT problem we're solving.
Part 2: The Evolution Story
From Simple Counting to Semantic Understanding
Now that we know why we need to convert text to numbers, let's explore how this problem has been solved over the past 30+ years. Each approach built on the limitations of the previous one.
The 30-Year Evolution Timeline
π Bag of Words
Simple counting"] --> B["2000s
βοΈ TF-IDF
Smart weighting"] B --> C["2013+
π§ Word2Vec/GloVe
Semantic vectors"] C --> D["2018+
π€ BERT/Contextual
Context-aware"] D --> E["2020+
π Modern LLMs
Same principles,
more sophisticated"] style A fill:#E3F2FD,stroke:#0078D4,stroke-width:3px style B fill:#FFE0B2,stroke:#F7630C,stroke-width:2px style C fill:#F3E5F5,stroke:#9C27B0,stroke-width:2px style D fill:#C8E6C9,stroke:#107C10,stroke-width:2px style E fill:#FFECB3,stroke:#FBC02D,stroke-width:2px
Each era solved specific problems but introduced new challenges
Your Journey Through Part 2
Section 4: Bag of Words
The simplest approach - just count!
Era: 1990s-2000s
Type: Sparse, discrete counts
Section 5: TF-IDF
Smarter counting with weights
Era: 2000s
Type: Weighted sparse vectors
Section 6: Word2Vec/GloVe
The semantic leap - dense vectors
Era: 2013+
Type: Dense, semantic embeddings
Section 7: Contextual (BERT)
Context matters - dynamic meaning
Era: 2018+
Type: Contextual embeddings
β οΈ Important: What You'll Learn
β Each approach has trade-offs - no single "best" solution
β Embeddings are learned from data, not magic
β Vector arithmetic (king-queen) works for Word2Vec but NOT universally
β Modern LLMs use the same core principles, just more sophisticated
Section 4: The Simplest Approach - Bag of Words
Let's start with the most intuitive idea: just count the words!
The Core Idea
Bag of Words treats text as an unordered collection of words. We simply count how many times each word appears. It's like throwing all the words into a bag, forgetting their order, and counting them.
Bag of Words: From Text to Count Matrix
Input Text
"I love NLP"
tokenize
Tokens
["I", "love", "NLP"]
Build Vocab
Vocabulary
{"I":0, "love":1, "NLP":2}
Count
Count Vector
I
1
love
1
NLP
1
Corpus: ["I love NLP", "I love ML"]
Result: Each doc becomes a vector of word counts
β Sparse, high-dimensional, but interpretable
The Step-by-Step Process
Step 1: Tokenization
Split text into individual words (tokens)
"I love NLP" β ["I", "love", "NLP"]
Step 2: Build Vocabulary
Collect all unique tokens from the entire corpus
{"I": 0, "love": 1, "NLP": 2, "ML": 3, ...}
Step 3: Count Occurrences
For each document, count how many times each vocab word appears
Document vector: [1, 1, 1, 0, 0, ...]
Code Example with sklearn
from sklearn.feature_extraction.text import CountVectorizer
# Documents
docs = [
"I love machine learning",
"I love coding",
"machine learning is amazing"
]
# Create and fit vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
# Vocabulary (sorted alphabetically)
print("Vocabulary:", vectorizer.get_feature_names_out())
# Output: ['amazing' 'coding' 'is' 'learning' 'love' 'machine']
# BoW Matrix (sparse by default, converting to dense for display)
print(X.toarray())
# Output:
# [[0 0 0 1 1 1] β Doc 1: "I love machine learning"
# [0 1 0 0 1 0] β Doc 2: "I love coding"
# [1 0 1 1 0 1]] β Doc 3: "machine learning is amazing"
# Notice: Most values are 0 (sparse!)
Strengths vs Limitations
The Fatal Flaw: "The movie was not good" vs "The movie was good"
BoW produces nearly identical vectors because word order is lost! This is why we needed better approaches.
When to Use BoW:
- Quick baseline for classification tasks (surprisingly effective!)
- Document similarity with controlled vocabulary
- When interpretability matters (can see which words drove the decision)
- Limited computational resources
- Topic modeling and keyword extraction
CountVectorizer on the movie reviews dataset and see how well it performs as a baseline.
Section 5: Smarter Counting - TF-IDF
Not all words are equally informative. TF-IDF weighs words by importance.
The Problem with Raw Counts
In BoW, common words like "the", "is", "a" get high counts but tell us little. Rare, specific words like "brilliant" or "terrible" are more informative for sentiment analysis.
The Core Insight: Frequent words across all documents matter less. Rare but present words matter more.
How TF-IDF Works
TF-IDF = Term Frequency Γ Inverse Document Frequency
It's a two-part formula that balances local frequency (in the document) with global rarity (across all documents).
Example Comparison: The Reweighting Effect
Notice: "brilliant" has only 2 occurrences but gets the highest TF-IDF score (0.89) because it's rare and informative!
The Impact: Why TF-IDF Matters
BoW vs TF-IDF: Decision Guide
π Use Bag of Words When:
- You need a quick baseline
- Vocabulary is small & controlled
- Speed is critical
- You want maximum interpretability
- Simple classification tasks
βοΈ Use TF-IDF When:
- Common words are drowning signal
- Information retrieval / search systems
- Document similarity / classification
- You need better feature quality
- Keyword extraction tasks
β Without TF-IDF
doc1 = "the the the movie"
doc2 = "the the the film"
BoW weights: [3, 1] and [3, 1]
# "the" dominates (75% of features)
# Can't distinguish docs well
Problem: Common words overwhelm signal
β With TF-IDF
doc1 = "the the the movie"
doc2 = "the the the film"
TF-IDF weights: [0.2, 0.8] and [0.2, 0.8]
# "the" downweighted (20%)
# "movie"/"film" emphasized (80%)
Solution: Informative words dominate
TfidfVectorizer and see which performs better on movie review sentiment classification.
Common Mistake: Data Leakage in Vectorization
The Error:
# WRONG! Fitting on all data
vectorizer = TfidfVectorizer()
all_vectors = vectorizer.fit_transform(all_texts) # β
X_train = all_vectors[:800]
X_test = all_vectors[800:]
Why It's Wrong: The vectorizer sees test data during fit(), learning vocabulary and IDF weights from test set. This leaks information!
The Fix:
# CORRECT! Fit only on train
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts) # β
Fit on train
X_test = vectorizer.transform(test_texts) # β
Transform test
Impact: Data leakage can inflate test accuracy by 5-10%, leading to production failures!
π€ Check Your Understanding
Q1: What's the key difference between BoW and TF-IDF?
Q2: When does TF-IDF help most?
Section 6: The Semantic Leap - Dense Word Embeddings
What if words could be represented as dense vectors that capture meaning?
From Sparse to Dense: The Paradigm Shift
Instead of sparse vectors with mostly zeros (BoW/TF-IDF), embeddings are dense vectors (typically 100-300 dimensions) where every single value is meaningful.
β Sparse Vectors Problem
- Dimension = vocabulary size (50,000+)
- 99.99% zeros (wasted space)
- No semantic relationships
- "good" and "great" are unrelated
β Dense Embeddings Solution
- Fixed dimensions (100-300)
- 100% dense (every value matters)
- Captures semantic meaning
- "good" and "great" are similar!
The Magic Property: Semantic Similarity
Words with similar meanings have similar vectors! This is the breakthrough that made embeddings revolutionary.
This was IMPOSSIBLE with BoW/TF-IDF! Sparse methods treated all words as equally unrelated. Embeddings capture that "king" and "queen" are similar concepts, while "king" and "dog" are not.
Measuring Similarity: Cosine Similarity
We measure semantic similarity using the cosine similarity between vectors (range: -1 to 1, where 1 means identical).
from sklearn.metrics.pairwise import cosine_similarity
similarity("king", "queen") = 0.72 # High! Similar concepts
similarity("king", "monarch") = 0.68 # High! Synonyms
similarity("king", "apple") = 0.03 # Low! Unrelated
# BoW/TF-IDF would treat "king-queen" and "king-apple"
# as equally unrelated (both = 0 overlap)
Popular Static EmbeddingA dense numeric vector representation of text that captures semantic meaning in continuous space. Models
Three foundational approaches, each with different training objectives:
Training Objective:
Predict context words from target word (Skip-gram) or target from context (CBOW)
Example Model:GoogleNews-vectors-negative300
Key Insight:
Words appearing in similar contexts get similar vectors
Dimensions: 300
VocabularyThe set of all unique tokens recognized by a tokenizer or model. Out-of-vocabulary (OOV) tokens are unknown.: 3M words
Training Objective:
Factorize global word co-occurrence matrix
Example Model:glove.6B.300d
Key Insight:
Combines global statistics with local context
Dimensions: 50/100/200/300
Vocabulary: 400K words
Training Objective:
Like Word2Vec but with subword n-grams
Example Model:cc.en.300.bin
Key Insight:
Handles rare/OOV words better using character n-grams
Dimensions: 300
Vocabulary: 2M words
How Are Embeddings Learned?
Embeddings aren't magicβthey're learned through training! Let's visualize how the training process works:
How Embeddings Are Learned: The Training Process
1. Context Window
Sentence:
"the king rules the land"
Target: "king"
Context: ["the", "rules"]
2. Lookup
Get "king"
embedding
[300 dims]
3. Predict
Context
words
β "the"
β "rules"
4. Compute Loss
How well did we predict context?
Loss = prediction error
High loss = bad embedding
5. Update Embedding
Backpropagation adjusts vectors
embedding β embedding β gradient
Move vectors closer for similar contexts
Repeat millions of times!
The Key Insight
Words appearing in similar contexts get updated in similar ways
"king" and "queen" both appear near: "the ___ ruled", "___ of England"
β Their embeddings become similar through training!
The Core Idea: Words that appear in similar contexts get similar embeddings. "king" and "queen" both appear near phrases like "the ___ ruled" and "___ of England", so their vectors become similar through millions of training iterations!
Vector Arithmetic: The Famous "King - Queen" Example
Static embeddings (especially Word2Vec) show fascinating linear patterns in semantic space:
Vector Arithmetic: The Famous "King - Queen" Example
king
man
woman
queen
What This Means:
In vector space, these words combine to transform meaning:
king
β man
+ woman
β queen
Royalty
"king" keeps royal status
β Still royalty
Gender
Subtract male, add female
β Gender transformation
Result
Female + Royalty
β "queen"
Other Famous Examples:
Paris β France + Germany β Berlin
walking β walk + swim β swimming
Why Does This Work?
- Gender is captured as a vector direction
- Royalty is preserved through the operation
- Relationships are encoded as vector offsets
- Training objective naturally creates these linear patterns
Important Caveat:
Vector arithmetic works best with static embeddings like Word2Vec trained on specific objectives. This property does NOT universally transfer to all embedding types!
Use this as intuition-building, not as a guarantee across all models. (We'll see why when we discuss contextual embeddings next.)
When to Use Static Embeddings
Good use cases:
- Similarity search and clustering
- Lightweight text classification with limited data
- Feature extraction for downstream models
- Fast inference requirements
gensim and explore similarity, analogies, and visualization.
Section 7: Context Matters - Contextual Embeddings
Static embeddings have a problem: words mean different things in different contexts.
The Polysemy Problem: Why Static Embeddings Fail
Consider the word "bank" - it has completely different meanings in different contexts:
The Polysemy Problem: Same Word, Different Meanings
The Word:
"bank"
Context 1: Financial
"I deposited money at the bank"
Meaning: Financial Institution π¦
Related: money, deposit, account
Context 2: Geographic
"We sat by the river bank"
Meaning: Edge of River ποΈ
Related: river, shore, water
β Static Embeddings Problem
"bank" always gets THE SAME vector regardless of context!
vector("bank") = [0.23, -0.15, 0.89, ...] (always identical)
Cannot distinguish between financial institution and river edge
The Solution: Contextual Embeddings
Models like BERT and GPT solve this by generating different embeddings for the same word based on surrounding context. This is the major breakthrough that enabled modern NLP!
How Contextual Embeddings Work: The Attention Mechanism
Models like BERT use Transformer architectures with an attention mechanism that allows each word to "look at" all other words in the sentence.
The Power of Attention: In "The movie was not good", the words "not" and "good" have strong attention to each other. This allows the model to understand that "not good" = negative sentiment, something static embeddings could never capture!
Static vs Contextual: Key Differences
π Static Embeddings
Word2Vec, GloVe, FastText
β Representation: One fixed vector per word
β Polysemy: Cannot distinguish meanings
β Training: Shallow models, faster
β Vector arithmetic: Clean linear patterns
π‘ Use case: Similarity, clustering, fast inference
π€ Contextual Embeddings
BERT, GPT, Modern LLMs
β Representation: Different vector per context
β Polysemy: Handles multiple meanings naturally
β Training: Deep Transformers, slower
β οΈ Vector arithmetic: Less consistent (context-dependent)
π‘ Use case: Complex NLP tasks, fine-tuning
β οΈ CRITICAL: Interpretability Caveat
This is one of the most important concepts to understand when working with embeddings!
Interpretability Caveat: Vector Arithmetic
β Static Embeddings
Word2Vec / GloVe
king - man + woman = queen
Why it works:
β’ One fixed vector per word
β’ Trained for semantic relationships
β οΈ Contextual Embeddings
BERT / GPT / Modern LLMs
king - man + woman = ???
Why it's inconsistent:
β’ Different vector per context
β’ Trained for task performance
Why Vector Arithmetic Doesn't Transfer to Contextual Models
Training Objective: BERT/GPT optimize for masked tokens or next-token prediction,
NOT explicit semantic relationships like Word2Vec
Context Dependence: Same word = different vectors in different contexts,
so no single "king" vector exists to manipulate
Design Tradeoff: Contextual models prioritize task performance
over interpretable linear structure
π‘ Use vector arithmetic as pedagogical intuition for static embeddings,
not as a universal embedding property!
CRITICAL Teaching Point:
The vector arithmetic intuition (king - man + woman β queen) is a beautiful property of static embeddings like Word2Vec, but it does NOT universally transfer to all embedding types!
Why the README emphasizes this: It's tempting to overgeneralize this property to all embeddings, but contextual models work fundamentally differently. Use vector arithmetic as pedagogical intuition for static embeddings, not as a guarantee across all embedding families!
Popular Contextual Models
Models to explore:
- BERT (
bert-base-uncased): Bidirectional, excellent for understanding tasks - GPT (
gpt2): Left-to-right, excellent for generation - Sentence-BERT (
all-MiniLM-L6-v2): Optimized for sentence similarity
sentence-transformers to generate contextual embeddings and compare them with static Word2Vec embeddings.
βοΈ Understanding the Mechanics
Now that you know what embeddings are and why they evolved, let's understand how they actually work under the hood.
How do embeddings learn?
From random numbers to meaningful vectors through training
What shapes embeddings?
Training data, objectives, and domain dependencies
Why tokenization matters?
The critical first step that determines everything
Your Journey Through Part 3
βοΈ How Embeddings Emerge
Training process demystified"] --> S2["Section 9
βοΈ Tokenization Introduction
The critical first step"] S2 --> S3["Section 10
π οΈ Building Tokenizers
Custom tokenization from scratch"] S3 --> S4["Section 11
π Tokenizer Impact
Effects on embeddings & tasks"] style S1 fill:#F3E5F5,stroke:#9C27B0,stroke-width:3px style S2 fill:#E8EAF6,stroke:#7B1FA2,stroke-width:2px style S3 fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px style S4 fill:#CE93D8,stroke:#7B1FA2,stroke-width:2px
π‘ What You'll Understand
β How embeddings start as random numbers and become meaningful through training
β Why different training objectives (Word2Vec vs BERT) create different embedding spaces
β How tokenization choices affect vocabulary size, OOV handling, and sequence length
β The tradeoffs between word-level, character-level, and subword tokenization
Section 8: How Embeddings Emerge from Training
Embeddings aren't magicβthey're learned parameters shaped by data and objectives.
The Learning Process: From Random to Meaningful
Embeddings start as completely random numbers and gradually evolve through training to capture meaningful semantic patterns. Let's visualize this transformation!
From Random Numbers to Meaningful Vectors
β Before Training: Random
"king"
"queen"
"apple"
Similarity("king", "queen") = 0.03
β Not similar at all!
Training
Millions of
examples
β After Training: Meaningful
"king"
"queen"
"apple"
Similarity("king", "queen") = 0.85
β Now very similar!
How It Happens:
Similar Contexts
"king" and "queen" appear in similar contexts:
"the ___ ruled"
"___ of England"
Prediction Task
Model learns to predict context from word
Similar contexts need similar vectors!
Gradient Updates
Backpropagation adjusts vectors
Similar usage β similar vectors
Core Principle: Similar usage patterns in training β Similar vectors in learned space
This is why embeddings capture semantic relationships without explicit programming!
The Magic of Training: Through millions of examples, the model learns that "king" and "queen" appear in similar contexts ("the ___ ruled", "___ of England"). To predict these contexts accurately, their vectors must become similar!
The Training Process (5 Steps)
Each word gets random vector"] --> B["Step 2: Training Objective
Predict context, reconstruct, etc."] B --> C["Step 3: Compute Loss
How wrong are predictions?"] C --> D["Step 4: Backpropagation
Update vectors to reduce loss"] D --> E["Step 5: Repeat
Millions of times"] E --> F["Result: Meaningful Embeddings
Similar words β similar vectors"] style A fill:#F3F2F1,stroke:#0078D4 style F fill:#107C10,stroke:#107C10,color:#fff
Training Dynamics
Training dynamics: Loss decreases rapidly initially, then converges around epoch 35. Validation loss tracks training closely, indicating good generalization.
Key Insight: What Shapes Embeddings?
Embeddings are NOT universal truthβthey are shaped by three key factors during training:
What Shapes Embeddings? Three Key Dependencies
Embeddings are shaped by how they were trainedβNOT universal truth
Training Data Dependency
The corpus determines patterns learned
β’ Wikipedia: General world knowledge, formal language
β’ Medical Journals: Clinical terminology, disease names
β’ Twitter/Social Media: Informal, slang, abbreviations, emojis
β’ Programming Repos: Code syntax, technical terms
Same word, different embeddings!
"Python" + snake (0.65) vs "Python" + Java (0.82)
Training Objective Dependency
The task determines geometric structure
β’ Word2Vec: Context prediction β clean linear patterns
β’ GPT: Next token β generation patterns
β’ BERT: Masked tokens β contextual understanding
β’ Sentence-BERT: Similarity β sentence-level clusters
Different objectives = Different geometry!
Word2Vec: king-man+woman β | BERT: Less reliable β³
Domain Context Dependency
Specialized corpora emphasize domain meanings
β’ Finance Corpus: "bank" β institution, deposits, loans
β’ Gaming Corpus: "bank" β money storage, vault
β’ Geography Corpus: "bank" β river edge, shore, waterside
β’ General Corpus: "bank" β mixed meanings, less specific
Domain specialization matters!
Finance model: bank + loan (0.89) | Geography: bank + river (0.91)
β οΈ Critical Implication
This is why choosing the RIGHT pre-trained model for your domain matters!
A model trained on Wikipedia β Model trained on Twitter β Model trained on medical texts
Critical Implication: This is why choosing the right pre-trained model for your domain matters!
A model trained on Wikipedia will have different embeddings than one trained on Twitter or medical texts, even with the same architecture.
Embedding Dimensions: Finding the Sweet Spot
The sweet spot: 300 dimensions balances accuracy (89%) with reasonable training time (60s). Beyond that, diminishing returns.
Common Training Objectives
Different training objectives lead to different embedding spaces. Here are the most popular approaches:
Context Prediction
Model: Word2Vec (Skip-gram/CBOW)
What It Learns: Words in similar contexts β similar vectors
Example: Predict "cat" from ["The", "sat", "on"]
Co-occurrence Factorization
Model: GloVe
What It Learns: Global word relationships from statistics
Example: "king" and "queen" co-occur often
Masked Language Modeling
Model: BERT
What It Learns: Bidirectional context understanding
Example: Predict "[MASK]" in "The cat [MASK] on mat"
Next Token Prediction
Model: GPT
What It Learns: Left-to-right generation patterns
Example: Given "The cat sat", predict "on"
Contrastive Similarity
Model: Sentence-BERT
What It Learns: Sentence-level semantic similarity
Example: "The movie was great" should be close to "The film was excellent"
Section 9: Tokenization - The Critical First Step
Before we can create embeddings, we must decide: how do we split text into tokens?
Why Tokenization Matters
Tokenization is NOT just preprocessingβit's a critical architectural decision that determines vocabulary, granularity, and ultimately your model's capabilities.
How Tokenization Affects Everything
π Vocabulary Size
Determines model memory and embedding table size
Word: 100K-1M tokens
Subword: 30K-50K tokens
Char: <100 tokens
β±οΈ Sequence Length
Affects computational cost and context window
Word: Short sequences
Subword: Medium sequences
Char: Very long sequences
π OOV Handling
How unknown words are processed
Word: UNK token (loses info)
Subword: Decompose (preserves info)
Char: Always known
π‘ Semantic Units
What meaningful chunks are preserved
Word: Natural units (best)
Subword: Morphemes (good)
Char: Letters only (weak)
β οΈ Critical Decision: Tokenization is not just preprocessing!
It fundamentally determines model architecture, performance, and behavior
Bad tokenization = Bad embeddings, no matter how good your model is
The Goldilocks Problem
Finding the right tokenization granularity is a balancing actβtoo coarse, too fine, or just right:
The Goldilocks Problem of Tokenization
Example: "Tokenization is preprocessing"
Too Coarse
Word-Level
Tokens:
3 tokens
Pros:
+ Natural semantic units
+ Short sequences
Cons:
- Huge vocabulary (100K+)
- Cannot handle "Tokenizations"
- OOV words become UNK
Problem: Too rigid!
Just Right β
Subword-Level (BPE)
Tokens:
6 tokens
Pros:
+ Balanced vocabulary (30K)
+ Handles variations
+ Decomposes unknowns
Cons:
- Slightly longer sequences
- Requires training
Best balance!
Too Fine
Character-Level
Tokens:
29 tokens
Pros:
+ Tiny vocabulary (<100)
+ No OOV ever
Cons:
- Very long sequences
- Loses semantic chunks
- Model must learn words
Problem: Too granular!
Modern NLP Solution: Subword Tokenization
Balances vocabulary size, sequence length, and OOV handling
Used by: BERT (WordPiece), GPT (BPE), T5 (Unigram)
Typical vocabulary: 30K-50K tokens
Token Length Distribution Across Strategies
Token length distribution by strategy: Character-level produces the longest sequences, word-level the shortest, subword strikes the balance.
Three Tokenization Strategies
Example Text: "Tokenization is preprocessing"
# Word-level tokenization
tokens = ["Tokenization", "is", "preprocessing"]
β 3 tokens, simple, but what about "Tokenizations" (plural)?
# Subword-level tokenization (BPE-style)
tokens = ["Token", "ization", "is", "pre", "process", "ing"]
β 6 tokens, handles variations like "Tokenize", "Tokenizer"
# Character-level tokenization
tokens = ["T","o","k","e","n","i","z","a","t","i","o","n"," ","i","s",...]
β 29 tokens, handles any word but sequences are very long
Quick Comparison: Which Strategy When?
Word-Level
Vocab: 100K-1M+ π
Seq Length: Short β
OOV: Poor (UNK) β
Semantics: Natural β
Use when: Controlled, small vocabulary domains
Subword-Level β
Vocab: 30K-50K β
Seq Length: Medium β
OOV: Good (decompose) β
Semantics: Morphemes β
Use when: General-purpose NLP (RECOMMENDED)
Character-Level
Vocab: <100 β
Seq Length: Very long β
OOV: Perfect ββ
Semantics: Must learn β οΈ
Use when: Noisy text, misspellings, extreme OOV
Popular Subword Algorithms
Modern NLP uses subword tokenization. Three main algorithms:
Algorithm:
Iteratively merge most frequent character pairs
Example:
"low", "lower", "lowest"
β "l", "o", "w", "e", "r", "s", "t"
β "low", "er", "est"
Used by: GPT, RoBERTa
Strength: Simple, effective, data-driven
Algorithm:
Merge based on likelihood increase
Example:
Similar to BPE but uses probability scoring
Used by: BERT
Strength: Slightly better linguistic properties than BPE
Algorithm:
Start large, prune unlikely subwords
Example:
Probabilistic: can generate multiple tokenizations
Used by: T5, XLNet
Strength: Flexible, handles ambiguity
Fragmentation comparison: Character-level severely fragments long words, while subword methods (BERT, GPT-2) balance well.
Real Tokenizer Examples
Same Text, Different Tokenizers
Text: "The unbelievable performance!"
# BERT (WordPiece):
["The", "un", "##bel", "##iev", "##able", "performance", "!"]
β Splits "unbelievable" into subwords with ## marker
# GPT-2 (BPE):
["The", "Δ un", "bel", "iev", "able", "Δ performance", "!"]
β Δ indicates space before token
# Character-level:
["T","h","e"," ","u","n","b","e","l","i","e","v","a","b","l","e",...]
β Every character separate
Result: Same text β different token IDs β different embeddings!
Critical Consequence:
You MUST use the same tokenizer that was used during model pre-training. Mismatched tokenizers break embeddings!
Example: Don't use GPT-2 tokenizer with BERT embeddings!
transformers library (BERT vs GPT-2) on the same text.
Section 10: Building a Tokenizer from Scratch
When would you build your own tokenizer, and how do you do it?
When to Build Custom Tokenizers
β Key Question:
Does a pre-trained tokenizer match your domain well enough?
β Use Pre-trained (Recommended)
When:
- General domain (news, web text)
- Well-covered language (English, etc.)
- Fast iteration needed
- Limited training data
β οΈ Build Custom (When Necessary)
When:
- Highly specialized domain
- Severe vocabulary mismatch
- Language not well-covered
- Privacy/compliance needs
Four Scenarios Requiring Custom Tokenizers
Specialized Domain
Medical/Legal/Code
Problem:
Pre-trained splits domain terms poorly
Solution:
"COVID-19" β one token
Underrepresented Language
Low-resource languages
Problem:
Existing tokenizers fragment heavily
Solution:
Train on native corpus
Vocabulary Mismatch
Severe fragmentation
Problem:
Common words become 5+ tokens
Solution:
Domain-specific vocab
Privacy/Compliance
Controlled training
Problem:
Cannot use external tokenizers
Solution:
Train on compliant data only
π‘ Practical Advice
Building custom tokenizers is expensive in time and iteration. Start with pre-trained tokenizers.
Only invest in custom tokenization when measurable performance gaps exist AND domain mismatch is the root cause.
The 6-Step Process
What data represents your task?"] --> B["2. Normalization Policy
Lowercase? Keep punctuation?"] B --> C["3. Pre-tokenization Strategy
Whitespace? Regex? Language-specific?"] C --> D["4. Subword Algorithm Choice
BPE, WordPiece, or Unigram?"] D --> E["5. Vocabulary Design
Size? Special tokens? Reserved terms?"] E --> F["6. Validation Checks
Coverage? OOV rate? Fragmentation?"] style A fill:#F3F2F1,stroke:#0078D4 style F fill:#107C10,stroke:#107C10,color:#fff
Six Critical Steps with Key Decisions
Define Corpus/Domain
What data represents your task?
β Good: Domain-matched corpus
β Bad: Generic Wikipedia for specialized domain
Normalization Policy
How to clean the text?
β Good: Preserve case when meaningful
β Bad: Lowercase everything blindly
Pre-tokenization Strategy
How to split into initial chunks?
β Good: Regex or language-specific
β Bad: Simple whitespace for all languages
Subword Algorithm Choice
Which algorithm to use?
β Good: BPE, WordPiece, or Unigram
β Bad: Word-level or char-level only
Vocabulary Design
What size and special tokens?
β Good: 30K-50K with domain terms
β Bad: Too small or too large
Validation Checks
Does it work well?
β Good: Check OOV, fragmentation, seq length
β Bad: Skip validation, use blindly
Step-by-Step Details
Step 1: Corpus/Domain Definition
# Bad: Training on Wikipedia for medical NLP
corpus = load_wikipedia() β
# Good: Domain-matched corpus
corpus = load_medical_texts() β
corpus += load_clinical_notes()
corpus += load_research_papers()
# Result: Vocabulary matches your actual use case
Step 2: Normalization Policy
# Decisions to make:
- Lowercase or preserve case?
β "Apple" (company) vs "apple" (fruit) - case matters!
- Remove accents/diacritics?
β "cafΓ©" β "cafe"? Loss of meaning in some languages
- Handle numbers?
β "COVID-19" β "COVID " or keep as-is?
- Unicode normalization?
β Different ways to represent Γ© (e + Μ vs single char)
Step 3: Pre-tokenization
# Simplest: whitespace splitting
"Hello world!" β ["Hello", "world!"]
# Better: regex-based
"Hello world!" β ["Hello", "world", "!"]
# Language-specific:
# Chinese/Japanese need character-aware splitting
"δ½ ε₯½δΈη" β ["δ½ ε₯½", "δΈη"] (not character-level!)
Step 4: Algorithm Choice
# Training BPE tokenizer (pseudocode)
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE())
tokenizer.train(
files=["medical_corpus.txt"],
vocab_size=30000,
min_frequency=2
)
# Now you have a domain-specific tokenizer!
Step 5: Vocabulary Design
# Key decisions:
1. Vocab size:
- Too small β over-fragmentation
- Too large β rare tokens, large embedding matrix
2. Special tokens:
[PAD], [UNK], [CLS], [SEP], [MASK]
3. Reserved terms (optional):
Domain-specific entities that should NOT be split
Example medical: "COVID-19", "MRI", "COPD"
Step 6: Validation
# Check 1: Vocabulary coverage
test_texts = load_test_set()
oov_rate = compute_oov_rate(tokenizer, test_texts)
print(f"OOV rate: {oov_rate:.2%}") # Goal: < 1%
# Check 2: Fragmentation
examples = ["unbelievable", "preprocessing", "COVID-19"]
for word in examples:
tokens = tokenizer.encode(word).tokens
print(f"{word} β {tokens}")
# Check if reasonable splits
# Check 3: Sequence length
avg_length = compute_avg_tokens(tokenizer, test_texts)
print(f"Average tokens: {avg_length}") # Goal: reasonable for model
Practical Advice:
Building custom tokenizers is expensive in time and iteration. Start with pre-trained tokenizers for general domains.
Only invest in custom tokenization when measurable performance gaps exist AND domain mismatch is the root cause.
Section 11: How Tokenizer Choices Affect Embeddings and Tasks
Tokenization isn't just preprocessingβit directly impacts embedding quality and task performance.
Four Critical Effects of Tokenization
Granularity β Semantic Cohesion
Over-fragmentation weakens meaning
β Bad: "unbelievable" β 12 character tokens
["u","n","b","e","l","i","e","v","a","b","l","e"]
β Good: "unbelievable" β 2 subword tokens
["un", "believable"]
π‘ Better semantics β Better embeddings
Coverage β OOV Handling
Domain mismatch causes fragmentation
β Bad: General tokenizer
"thrombocytopenia" β 8 fragments
β Good: Medical tokenizer
"thrombocytopenia" β 1 token
π‘ Better coverage β Better task performance
Length β Computational Cost
More tokens = quadratic attention cost
β Bad: Over-fragmented
15 tokens β 15Β² = 225 operations
β Good: Well-designed
8 tokens β 8Β² = 64 operations
π‘ 3.5x speedup + memory savings
Transfer β Pretraining Alignment
Different tokenizer = embeddings don't transfer
β Bad: Custom tokenizer + BERT model
Token IDs don't match β garbage!
β Good: BERT tokenizer + BERT model
Token IDs match β works!
π‘ Token ID alignment is CRITICAL
Detailed Examples of Each Effect
Effect 1: Granularity β Semantic Cohesion
# Over-fragmentation weakens meaning
Word: "unbelievable"
# Bad tokenization (char-level):
["u","n","b","e","l","i","e","v","a","b","l","e"]
β Model must learn from scratch that these chars = concept
# Good tokenization (subword):
["un", "believable"] or ["unbeliev", "able"]
β Model sees morphological structure
Impact: Better semantics β better embeddings
Effect 2: Coverage β OOV Handling
# Domain: Medical NLP
Text: "Patient has thrombocytopenia"
# General tokenizer:
["Patient", "has", "th", "##rom", "##bo", "##cy", "##top", "##enia"]
β 8 fragments! Medical term lost
# Medical tokenizer:
["Patient", "has", "thrombocytopenia"]
β 3 tokens, medical term preserved
Impact: Better domain coverage β better task performance
Effect 3: Length β Computational Cost
# Same text, different tokenizers
Text: "The patient presented with severe symptoms"
# Over-fragmenting tokenizer:
β 15 tokens β 15Β² attention matrix = 225 operations
# Well-designed tokenizer:
β 8 tokens β 8Β² = 64 operations
Impact: 3.5x speedup! (plus memory savings)
Effect 4: Transfer β Pretraining Alignment
# Scenario: Fine-tuning BERT
# β Wrong: Use different tokenizer
custom_tokenizer = train_bpe(my_data)
bert_model = load_bert()
# Token IDs don't match β embeddings are garbage!
# β
Right: Use BERT's tokenizer
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_model = AutoModel.from_pretrained("bert-base-uncased")
# Token IDs match β embeddings transfer correctly
Tokenization Impact Across Different Tasks
Classification
65% Severity
Critical Factors:
- Semantic cohesion of tokens
- Domain vocabulary coverage
- Handling of rare/OOV terms
Sentiment: "not good" β needs proper boundaries
Retrieval/Search
85% Severity
Critical Factors:
- Term matching precision
- Vocabulary overlap
- Query-document alignment
Search: "COVID-19" must match exactly
Text Generation
95% Severity
Critical Factors:
- Fluency depends on boundaries
- Word formation quality
- Coherent token sequences
GPT: Poor tokenization β broken words
NER/Token Classification
95% Severity
Critical Factors:
- Entity boundaries align with tokens
- No mid-entity fragmentation
- Consistent entity representation
NER: "New York" β must stay together
β οΈ Critical Insight
Tokenization is NOT just preprocessingβit's a representation design decision
Bad tokenization β fragmented semantics β weak embeddings β poor task performance
This happens regardless of model architecture quality!
π― Making Practical Decisions
Theory is great, but now comes the real question: What should YOU actually do?
Direct embeddings or fine-tune?
When to use embeddings as-is vs training
Build or use pretrained?
Should you train from scratch or leverage existing models?
What to watch out for?
Common pitfalls and how to avoid them
Decision Framework
π― Direct vs Fine-Tuned
Decision matrix"] --> S2["Section 13
ποΈ Scratch vs Pretrained
When to build"] S2 --> S3["Section 14
β οΈ Practical Pitfalls
What to avoid"] style S1 fill:#FFF3E0,stroke:#E65100,stroke-width:3px style S2 fill:#FFE0B2,stroke:#E65100,stroke-width:2px style S3 fill:#FFCCBC,stroke:#E65100,stroke-width:2px
β Clear Decisions Ahead
β A practical decision matrix for choosing between direct embeddings and fine-tuning
β When building from scratch makes sense (spoiler: almost never!)
β Common pitfalls that waste time and how to avoid them
Section 12: When to Use Embeddings Directly
Should you use embeddings as-is, or fine-tune a full model?
Decision Matrix
| Use Case | Direct Embeddings | Fine-Tuned Model | Recommended Approach |
|---|---|---|---|
| Semantic search / RAG | β Perfect fit | β Overkill | Direct embeddings (sentence-transformers) |
| Clustering / Topic grouping | β Fast and effective | β Not needed | Direct embeddings |
| Near-duplicate detection | β Cosine similarity works | β Expensive | Direct embeddings |
| Simple classification (small data) | β Good baseline | β οΈ May overfit | Start with embeddings + simple classifier |
| Token-level tasks (NER) | β No token boundaries | β Required | Fine-tuned model |
| Generation tasks | β Not applicable | β Required | Full generative model |
| Complex classification (large data) | β οΈ May plateau | β Better accuracy | Fine-tune if embeddings underperform |
| Highly specialized domain | β οΈ If pretrained fits | β If domain gap large | Depends on domain mismatch severity |
Practical Heuristic
The Default Path:
- Start with direct embeddings for fast iteration
- Evaluate performance and error patterns
- Move to fine-tuning ONLY when error analysis shows representation limits
Most tasks don't need fine-tuning. Save time and compute for when it actually matters.
Section 13: Build from Scratch vs Use Pretrained
Should you train your own tokenizer and embeddings, or use pretrained?
The Decision Framework
close to general language?} B -->|Yes| C[Use Pretrained] B -->|No| D{Do you have
large domain corpus?} D -->|No| C D -->|Yes| E{Is performance gap
measurable and large?} E -->|No| C E -->|Yes| F{Do you have
time & compute budget?} F -->|No| C F -->|Yes| G[Consider Custom Training] style C fill:#107C10,stroke:#107C10,color:#fff style G fill:#F7630C,stroke:#F7630C,color:#fff
When to Use Pretrained
Default to pretrained when:
- Domain is general or mainstream (news, social media, web text)
- Speed and baseline quality are priority
- Labeled data is limited
- Team lacks NLP infrastructure expertise
- Compute/time budget is constrained
Recommended models: sentence-transformers, BERT variants, GPT variants
When to Consider Custom Training
Only consider custom when:
- Domain language is highly specialized (medical, legal, scientific, code)
- Vocabulary mismatch causes severe over-fragmentation
- Compliance/privacy requires controlled training pipelines
- Long-term product value justifies maintenance cost
- Measurable performance gap exists AND domain mismatch is root cause
Custom training is expensive: requires data, compute, expertise, and ongoing maintenance.
Section 14: What to Watch Out For
Representation choices are never neutral. Here's what to evaluate:
Design Considerations
| Factor | Why It Matters | Tradeoff |
|---|---|---|
| Token granularity | Word vs subword vs char affects semantics | Coarse = simpler but less flexible Fine = flexible but longer sequences |
| Normalization rules | Case, punctuation, numbers affect meaning | Aggressive = cleaner but loses nuance Minimal = preserves signal but noisy |
| Domain vocabulary coverage | OOV tokens break semantics | General model = broad but shallow Domain model = deep but narrow |
| Sequence length | Longer sequences = more compute/memory | Long context = better understanding but slower Short context = faster but may truncate |
| Task dependency | Classification vs retrieval vs generation | Task-specific optimization vs general purpose |
Common Pitfalls to Avoid
Don't Do These:
- Fitting vectorizers on full data: Use train split only to avoid data leakage!
- Over-cleaning text: Removing "not" or punctuation can reverse sentiment
- Tokenizer mismatch: Don't use GPT-2 tokenizer with BERT embeddings
- Expecting Word2Vec arithmetic everywhere: Contextual embeddings don't work the same way
- Ignoring Unicode issues: Encoding problems create garbage tokens
π Practical Guidelines
From raw text to production: concrete, actionable recommendations for building text representation pipelines.
π― Your End-to-End Workflow
Understand Task
Text EDA
Choose Strategy
Select Tokenizer
Preprocess
Generate Vectors
Build Baseline
Iterate!
π‘ Core Principle
Start simple, iterate based on errors. Don't jump to complex solutions before understanding where simple approaches fail.
Section 15: Recommended Workflow
From raw text to production: a practical step-by-step guide.
The 8-Step Workflow
Classification? Retrieval? Generation?"] --> B["2. Perform Text EDA
Length, vocabulary, quality checks"] B --> C["3. Choose Representation Strategy
Sparse (BoW/TF-IDF) or Dense (embeddings)?"] C --> D["4. Select Tokenizer
Match to model if using pretrained"] D --> E["5. Apply Preprocessing
Normalize, clean (minimal!), handle special cases"] E --> F["6. Generate Representations
Vectors for training"] F --> G["7. Build Baseline Model
Start simple"] G --> H["8. Iterate Based on Errors
Analyze failures, improve representations"] H -.->|If needed| C style A fill:#F3F2F1,stroke:#0078D4 style H fill:#107C10,stroke:#107C10,color:#fff
Step-by-Step Details
Step 1 & 2: Task + EDA
# Understand your task
task = "sentiment classification"
metric = "F1-score"
# Text EDA essentials
import pandas as pd
df = load_data()
print(df.describe())
# Check text length distribution
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
# Vocabulary richness
unique_words = set(' '.join(df['text']).split())
print(f"Vocabulary size: {len(unique_words)}")
# Class balance
print(df['label'].value_counts())
Step 3-5: Representation Strategy
# Start with simplest that might work
from sklearn.feature_extraction.text import TfidfVectorizer
# Option A: TF-IDF baseline
vectorizer = TfidfVectorizer(max_features=5000, min_df=2)
X_train = vectorizer.fit_transform(train_texts) # fit on train only!
X_test = vectorizer.transform(test_texts)
# Option B: Pretrained embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
X_train = model.encode(train_texts, show_progress_bar=True)
X_test = model.encode(test_texts)
# Compare both approaches!
Quick Start Guide: Which Representation Strategy?
TF-IDF
Sparse, Simple, Fast
When to Use:
- Small-medium datasets
- Text classification
- Need interpretability
- Limited compute
β Pros:
Very fast, interpretable, no pretrained model needed
β Cons:
No semantics, high dimensionality, OOV issues
Quick Start:
TfidfVectorizer()
π‘ Best for baselines
Sentence Embeddings
Dense, Semantic, Pretrained
When to Use:
- Semantic search/RAG
- Clustering
- Similarity tasks
- Medium-large datasets
β Pros:
Captures semantics, low dimensionality, handles OOV well
β Cons:
Less interpretable, slower than TF-IDF
Quick Start:
SentenceTransformer("all-MiniLM-L6-v2")
Fine-Tuned Models
Task-Specific, Powerful
When to Use:
- NER, token classification
- Generation tasks
- Complex classification
- Large labeled datasets
β Pros:
Best accuracy, task-adapted, contextual
β Cons:
Slow training, needs labeled data, high compute cost
Quick Start:
AutoModel.from_pretrained("bert-base")
π― When max accuracy needed
π― Decision Rule
Start with TF-IDF baseline β Try Sentence Embeddings for semantic tasks β Fine-tune only when needed
Most problems don't need fine-tuning! Embeddings work great for 80%+ of use cases.
Quick Reference: Model Selection by Task
| Task | Recommended Starting Point | When to Upgrade |
|---|---|---|
| Text classification | TF-IDF + Logistic Regression | Baseline < 80% accuracy |
| Semantic search | sentence-transformers | Rare, already near-optimal |
| Clustering | Word2Vec or GloVe averaged | Clusters not semantically coherent |
| NER / Token tasks | Fine-tuned BERT | N/A (start with full model) |
Best Practices:
- Version everything: Tokenizer, model, preprocessing pipeline
- Monitor OOV rate: High OOV = representation problem
- Check sequence length: Truncation = information loss
- Validate on held-out data: Avoid overfitting to test set
- Error analysis first: Before scaling up, understand failures
π Bringing It All Together
Connect all the pieces: from raw text through embeddings to modern LLMs. See the complete picture.
The Complete Pipeline
Every step from text to predictionsβhow it all connects
Modern LLMs Connection
How everything applies to GPT, BERT, and Claude
Connecting the Dots
βοΈ Complete Pipeline
End-to-end flow"] --> S2["Section 17
π Modern LLMs
Evolution & connection"] style S1 fill:#F3E5F5,stroke:#7B1FA2,stroke-width:3px style S2 fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px
π‘ The Big Picture
Modern LLMs didn't replace the fundamentalsβthey automated and scaled them. Understanding the pipeline gives you the foundation to use any NLP system effectively.
Section 16: The Full Chain - Text to Embeddings to Tasks
Every NLP system follows the same fundamental flow.
The Universal NLP Pipeline
'This movie was great!'"] --> B["Preprocessing
Normalize, clean"] B --> C["Tokenization
['This', 'movie', 'was', 'great', '!']"] C --> D["Token IDs
[2023, 3544, 2001, 2307, 999]"] D --> E["Embeddings
Lookup or generate vectors"] E --> F["Model Processing
Transformers, classifiers, etc."] F --> G["Task Output
Positive sentiment, 0.92 confidence"] style A fill:#F3F2F1,stroke:#0078D4 style E fill:#5C2D91,stroke:#5C2D91,color:#fff style G fill:#107C10,stroke:#107C10,color:#fff
Detailed Example: End-to-End
Sentiment Classification Pipeline
# Step 1: Raw text
text = "This movie was not great, but I still enjoyed it!"
# Step 2: Preprocessing (minimal!)
text = text.lower() # Optional: depends on model
# Step 3: Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize(text)
# ['this', 'movie', 'was', 'not', 'great', ',', 'but', 'i', 'still', 'enjoyed', 'it', '!']
# Step 4: Convert to Token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
# [2023, 3544, 2001, 2025, 2307, 1010, 2021, 1045, 2145, 5632, 2009, 999]
# Step 5: Get embeddings
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")
embeddings = model.embeddings.word_embeddings(token_ids)
# Shape: [12 tokens, 768 dimensions]
# Step 6: Model processing (contextual attention)
outputs = model(input_ids=token_ids)
sentence_embedding = outputs.last_hidden_state.mean(dim=1) # Pool tokens
# Shape: [768] - single vector for entire sentence
# Step 7: Task-specific layer
classifier = nn.Linear(768, 2) # 2 classes: pos/neg
logits = classifier(sentence_embedding)
prediction = torch.softmax(logits, dim=-1)
# [0.15, 0.85] β 85% positive!
# The model understood "not great, but...enjoyed" = overall positive!
Why Each Step Matters
| Step | Purpose | Impact if Done Wrong |
|---|---|---|
| Preprocessing | Normalize noise without losing signal | Over-clean β lose meaning; Under-clean β noisy patterns |
| Tokenization | Split text into learnable units | Bad splits β fragmented semantics, OOV issues |
| Token IDs | Convert symbols to integers | Mismatched vocabulary β garbage lookups |
| Embeddings | Convert IDs to dense semantic vectors | Poor embeddings β model can't learn patterns |
| Model | Learn task-specific patterns | Wrong architecture β suboptimal performance |
| Output | Map learned patterns to task predictions | Misaligned objective β learns wrong thing |
The Garbage In, Garbage Out Principle:
Every step builds on the previous one. Bad tokenization β bad embeddings β bad model, regardless of how sophisticated your architecture is.
This is why we spent so much time on representations!
Section 17: How This Connects to Modern LLMs
Everything we learned applies to GPT, BERT, and modern Transformers.
The Evolution Timeline
Classical Era
Sparse Vectors
Examples: BoW, TF-IDF, N-grams
β Works: Fast, interpretable
β Fails: No semantics, high dimensionality
Dense Embedding Era
Static Dense
Examples: Word2Vec, GloVe, FastText
β Works: Semantics! Low dimension
β Fails: One vector per word, polysemy
Contextual Era
Contextual Dense
Examples: BERT, ELMo, GPT-2
β Works: Context-aware, transfer learning
β Fails: Slow, max sequence length
LLM Era
Massive Scale
Examples: GPT-3/4, Claude, Llama
β Works: Few-shot, emergent abilities
β Fails: Huge compute, hallucinations
β What ALWAYS Stayed the Same
β’ Text must become numbers (tokenization still critical)
β’ Embeddings are learned representations (still at core)
β’ Quality of tokenization determines embedding quality
β’ Domain mismatch still hurts performance
Garbage in, garbage out principle still applies!
What Stayed the Same
Core principles unchanged:
- Text must still become numbers (tokenization + IDs)
- Models still learn through embeddings (now in embedding layers)
- Tokenizer quality still matters (BPE/WordPiece still used)
- Domain mismatch still hurts performance
- Garbage in, garbage out still applies
What Changed
Modern improvements:
- Scale: Billions of parameters, trained on trillions of tokens
- Context: Contextual embeddings by default (BERT/GPT)
- Transfer: Pre-training + fine-tuning paradigm
- Architecture: Transformers with self-attention
- Flexibility: Same model for many tasks (prompt engineering)
The Transformer Era Pipeline
Modern LLM Workflow
# Using a modern LLM (e.g., GPT or BERT)
from transformers import pipeline
# Step 1: Load pretrained model (includes tokenizer + embeddings + model)
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
# Step 2: Just pass text!
result = classifier("This movie was not great, but I still enjoyed it!")
# [{'label': 'POSITIVE', 'score': 0.92}]
# Behind the scenes:
# 1. Text β tokenizer (BPE/WordPiece)
# 2. Tokens β IDs (vocabulary lookup)
# 3. IDs β embeddings (learned embedding layer)
# 4. Embeddings β Transformer layers (self-attention)
# 5. Output β task head (classification)
# Same pipeline we learned, now automated and scaled!
π» From Theory to Practice
You've learned the fundamentals. Now it's time to get hands-on with real code and data!
π Your Learning Journey
Part 1
Foundation
Part 2
Evolution
Part 3
Mechanics
Part 4
Decisions
Part 5
Guidelines
Part 6
Integration
Part 7
Practice!
Theory complete! Now apply everything in the hands-on notebook.
π What You'll Build
Text EDA
Preprocessing
Representations
What's in the Notebook
Explore the NLTK movie reviews dataset
- Length distributions
- Vocabulary analysis
- Class balance checks
- Data quality signals
Hands-on preprocessing pipeline
- Tokenization strategies
- Stopword removal
- Lemmatization
- Comparison of approaches
Compare text-to-number methods
- BoW with CountVectorizer
- TF-IDF with TfidfVectorizer
- Word2Vec embeddings
- Sentence-BERT embeddings
Learning Objectives (Revisited)
By completing the notebook, you'll be able to:
- β Explain how models learn from numbers and why text needs encoding
- β Perform structured text EDA before modeling
- β Apply and compare preprocessing strategies with NLTK
- β Convert text into multiple numeric representations
- β Interpret embedding behavior with correct caveats
- β Compare tokenizers (BERT vs GPT-2)
- β Decide when direct embeddings are appropriate
- β Describe the full chain: text β numbers β model β embeddings
Next Steps: Transformers
You're ready for the next module!
With this foundation, you can now dive into:
- Transformer architecture: Self-attention, positional encoding, encoder-decoder
- Training objectives: Masked LM, causal LM, seq2seq
- Fine-tuning strategies: Full fine-tuning vs LoRA vs prompt tuning
- Modern LLM applications: RAG, agents, tool use
You now understand what happens before the Transformerβthe tokenization and embedding layers that feed into attention mechanisms.
Ready to Code?
Open the Jupyter notebook and start building your first NLP pipeline!
π nltk_text_preprocessing_hands_on.ipynb